BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) #61909

jorisvandenbossche · 2025-07-19T18:28:34Z

I ran into one more case of the sum of empty / all-NaN to use "0" instead of empty string (#60229), specifically when effectively introducing empty groups with categorical data with observed=False.

Follow-up on #60936

The passing through of is_string through several layers is a bit annoying, but effectively is needed to for now only changes this for string dtype, and not for object dtype in general (which in the other PR related to this, we did for now)

…or string dtype (empty string instead of 0)

jbrockmendel · 2025-07-19T23:49:54Z

pandas/_libs/groupby.pyx

@@ -729,6 +730,10 @@ def group_sum(
    sumx = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype)
    compensation = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype)

+    if is_string:
+        # for strings start with empty string instead of 0 as `initial` value
+        sumx = np.full((<object>out).shape, "", dtype=object)


Would passing “initial” be more general/idiomatic?

I was initially thinking that as well, but then because this would in practice only be used for the specific case of strings, I thought to be more explicit about that fact. But in both cases I have to pass it down a few layers, so whether it is then called initial or is_string actually does not matter much, and initial at least makes it clearer where it is called from the EA _groupby_ops what the purpose is of passing it down.
So updated to use initial in the last commit

pandas/_libs/groupby.pyx

rhshadrach

lgtm

lumberbot-app · 2025-07-22T07:20:27Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 27928edc61f5b01e933036a99549636425e5a557

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #61909: BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0)'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-61909-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #61909 on branch 2.3.x (BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0))"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…or string dtype (empty string instead of 0) (pandas-dev#61909)

jorisvandenbossche · 2025-07-26T12:24:38Z

Manual backport -> #61963

BUG: fix fill value for gouped sum in case of unobserved categories f…

ddbc8ec

…or string dtype (empty string instead of 0)

jorisvandenbossche added this to the 2.3.2 milestone Jul 19, 2025

jorisvandenbossche requested a review from rhshadrach as a code owner July 19, 2025 18:28

jorisvandenbossche added Bug Groupby Strings String extension data type and string data labels Jul 19, 2025

fix one more test

6a32c83

jbrockmendel reviewed Jul 19, 2025

View reviewed changes

jorisvandenbossche added 2 commits July 20, 2025 12:47

use initial instead + fix test for non-infer mode

88501c6

fix typing

b0013bd

mroeschke approved these changes Jul 21, 2025

View reviewed changes

pandas/_libs/groupby.pyx Show resolved Hide resolved

rhshadrach approved these changes Jul 21, 2025

View reviewed changes

jorisvandenbossche merged commit 27928ed into pandas-dev:main Jul 22, 2025
43 checks passed

jorisvandenbossche deleted the string-dtype-groupby-sum-observed-false-fillvalue branch July 22, 2025 07:20

lumberbot-app bot added the Still Needs Manual Backport label Jul 22, 2025

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jul 26, 2025

BUG: fix fill value for gouped sum in case of unobserved categories f…

7f6206c

…or string dtype (empty string instead of 0) (pandas-dev#61909)

jorisvandenbossche mentioned this pull request Jul 26, 2025

[backport 2.3.x] BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) (#61909) #61963

Open

jorisvandenbossche removed the Still Needs Manual Backport label Jul 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) #61909

BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) #61909

jorisvandenbossche commented Jul 19, 2025

Uh oh!

jbrockmendel Jul 19, 2025

Uh oh!

jorisvandenbossche Jul 20, 2025

Uh oh!

Uh oh!

rhshadrach left a comment

Uh oh!

Uh oh!

lumberbot-app bot commented Jul 22, 2025

Uh oh!

jorisvandenbossche commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) #61909

BUG: fix fill value for gouped sum in case of unobserved categories for string dtype (empty string instead of 0) #61909

Conversation

jorisvandenbossche commented Jul 19, 2025

Uh oh!

jbrockmendel Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lumberbot-app bot commented Jul 22, 2025

Uh oh!

jorisvandenbossche commented Jul 26, 2025

Uh oh!

Uh oh!